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Abstract 

Text categorization refers to the automatic labelling of documents, based on natural language 
text contained in or associated with each document, into one or more pre-defined categories. 
Today, image categorization is a necessity due to a very large amount of image documents that 
we have to deal with daily. The current image categorization system uses an associated text 
approach for classification of images. We propose herein a new approach for automatic image 
categorization on android mobile devices, an application for classification of document images 
based on its contents, which is useful to businessmen, teachers and students. The classification 
module is the primary module. OCR technology is used to extract the textual contents from the 
input images. The textual contents extracted are given as input to the classification module which 
automatically classifies the images based on hashing techniques. The searching module is used to 
search for relevant image documents based on user keyword. The interface of the Android OS 
makes the end-user easy and efficient to search the relevant images into the database based on 
user keyword. Our literature survey leads to conclusion that mining is a good and promising 
strategy for automatic image categorization. 

Keywords — Text Mining, Information Retrieval, Data Mining, OCR, Text classification, Feature 
extraction 
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I. 



Introduction 



Our objective is to utilize the visual capabilities of the Android mobile phone to extract text 
from an input image. We use the camera features of the Android to capture image. 

Any camera image of the document would be subject to several environmental conditions, 
such as variable lighting, reflection, rotation, and scaling (we would desire the same data to be 
extracted from the images regardless of the distance from the camera), among others. 

Text classification attempts to associate a text with a given category based on its content. Text 
categorization is the task of automatically sorting documents into categories from a predefined 
set. In our work, we propose an evaluation of more significant amount of features. 



Text Mining (TM) is a new, challenging and multi-disciplinary area, which includes spheres of 
knowledge like Computing, Statistics, Predictive, Linguistics and Cognitive Science. TM has 
been applied in a variety of concerns and applications. Some applications are summary creation, 
clustering, language identification, term extraction and categorization, electronic mail 
management, document management, and market research with an investigation. 

TM consists of extracting regularities, patterns, and categorizing text in large volume of texts 
written in a natural language; therefore, NLP is used to process such text by segmenting it into its 
specific and constituent parts for further processing. J^^^^^^ JH 

| ii. Problem formulation jjf m 

A. Problem Definition - mp Jb ^UrH J^uJL 

Given an input textual image to the system, the system uses a web service to extract the textual 
features from the image which are used to auto classify the images. Data mining techniques are 
used for the same. The classified images are now easy to search for the given input keyword. 
The relative images and the respective text is retrieved for the end-user as an end-result. 



B. Objective - 

The main objective of ScanDroid system is to provide the user with an efficient image retrieval 
tool that will allow the end-user to search through the database which is based on the content of 
the input textual image. 
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in. Mathematical Model 
We now provide a model of the system in terms of Set theory domain. 



1. Let 'S' be the Content based classification and retrieval of textual images. 

s= { 



2. Identify the inputs as I. 
S= {I... 

Where I = {11,12} 

11 = {i | T is the image file format from which text can 
be extracted} = {*.jpg,*.bmp,*.tiff,*.png} 

12 = {t | 6 t' is location of input image file on the phone 
memory} 



3. Identify the processes as P. 
S= {I, 0,P ... 
P= {Ex, Fe, Cf, R} 



4. Ex is the set for extraction module activities. 

Ex = {Ei, Ep, Eo} H^^^^^^^^^B 

■ Ei= {f| T is the valid image file to extraction module.} 

■ Ep= {f | T is the extraction function to convert the Fi to Fo.} 



Ep (Ei) = Eo 



Eo= {f | T is the output generated by extraction module i.e. text document file} 



5. Fe is the set for feature extraction module activities. 
Fe = {Fi, Fp, Fo} 

■ Fi= {f| T is the valid input text document to feature extraction module.} 

■ Fp= {f | T is the feature extraction function to convert the Fi to Fo.} 

Fp (Fi) = Fo 

■ Fo= {f | T is the output generated by feature extraction module i.e. character sequence.} 
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6. Cf is the set of classification module activities. 
Cf = {Ci,Cp,Co} 

■ Ci= {f| T is the features extracted to classification module.} 

■ Cp= {f | T is the classification function to convert the Ci to Co.} 

Cp (Ci) = Co 

■ Co= {f | T is the output generated by classification module.} 



7. R is the set of search and retrieval activities and associated data. 
R= {Ri, Rp, Ro} 

■ Rip= {r| 6 r' is the keyword query} 

■ Rp= {r| 6 r' is searching and retrieval function} 

Rp (Ri) = Ro 

■ Ro= {r| 'r' is metadata of matching image file.} 



8. Identify failure cases as F 

{I, O, P,F...^^^fc^^^ 
Where F = {F1,F2, F3} 
Failure occurs when - 

■ Fl= {1| T is the image containing no texF^^^^^^^^ 

■ F2= {p| 'p' is no matching keyword in domain dictionary} 

■ F3= {m|'m' is improper image} 



9. Identify success case (terminating case) as E 

S= {I, O, P, F, E... 
Where E= {El, E2, E3} 
Success is defined as- 

El= {p| 'p' is image retrieved properly based on 
keyword. } 

E2= {q| 'q' is properly classified} 

E3= {r| 6 r' is text file created successfully} 
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Mathematical Representation 



Let 6 S' be the system - 

S= {I, Ei, Fi, Ci, Ri, Ep, Fp, Cp, Rp, Eo, Fo, Co, Ro, F, E} 



where, 

I = set of valid input data set where I = {II, 12}. 

II = set of valid image formats. 

12 = set of image locations on phone memory. 
Ei = set of valid image files to text extraction module. 
Fi = set of valid input text document to feature extraction 
module. 

Ci = set of features extracted to classification module. 

Ri = set of input to image retrieval module. 

Ep = set of text extraction module function. 

Fp = set of feature extraction module function. 

Cp = set of classification function to convert the Ci to Co. 

Rp = set of image retrieval module functions. 

Eo = set of output of text extraction module. 

Fo = set of output of feature extraction module. ^^"^W 

Co = set of output generated by classification module. 

Ro = set of output of image retrieval module. 

F = set of failure cases. ^ I^H m 

E = set of success case. 
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Fig 1. : Basic Flow Diagram of ScanDroid App 



iv. Framework 



The basic framework for ScanDroid system consist of major two categories namely 
Classification and Retrieval. 



1. Classification 

It includes 3 phases, the end-result is classified image. 



Phase 1 (Pre-processing Phase) is the text extraction processing. OCR is used for the same. 
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■ Phase 2 (Tokenization Phase) includes the creation of tokens which is required for classification. 
This process includes the removal of stop words and filtering of the words which is known as 



■ Phase 3 (Classification Phase) is responsible for assigning categories based on the keywords 
from feature extraction module. 

2. Retrieval 

It includes 2 phases, the end-result is the respective 
image. 

■ Phase 1 (Lookup Phase) takes an input keyword, compares with all the indexed items in the table 
to get the relevant matches. 

■ Phase 2 (Retrieval Phase) analysis all the matched keywords and returns the respective image(s). 



There are two algorithms followed in the application namely: The Naive Bayes classifier 
algorithm and the Searching Algorithm 

4- Naive Bayes Algorithm ^^^^^^^^^^1 l_ J 1 

The naive Bayes classifier has been successfully used in the Rainbow text classification 
system. Let C = (cl,....,c m ) be m document classes. Given a new unlabelled document D and its 
corresponding word-list VV= (co 1, co $) (defined in the same way as the word-list for the 

training set), the naive Bayes approach assigns D to a class NB as follows: 



stemming. 



V. 



Algorithm 




i=1 
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Where P(q) is the priori probability of class q and P{co i I q) is the conditional probability of 
word co i given class q >y the probabilities of words occurring in a document are independent of 
each other. 



When the size of the training set is small, the relative frequency estimates of probabilities, 
P{co i I q) will not be reasonable; if a word never appears in the given training data, its relative 
frequency estimate will be zero. 



The estimate of the probability P(co / 1 q) is given as: 



P(Wi\Cj) = 



% + 1 

nj + kj 



where itj is the total number of words in class c 7 , na is the number of occurrences of word co / in 



class q and kj is the vocabulary size of class q. This is the result of the Bayesian estimation with 
a uniform prior assumption, i.e. probabilities of the occurrence of words appearing in class q are 
equally likely. B k^ 
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Fig 2: Overview of Text Mining 



■J- Content based searching algorithm 



The searching algorithm uses the searching within the hash table. Search query is given input to 
the system. The query is compared with the key value of the hash table. The hash table contains 
all the extracted documents and the relevant text. The files containing the query are returned to 
the list view of the GUI. The user has the ability to view the retrieved files. 
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vi. Expected result 

The expected output of the system is a set of classified images retrieved after text mining 
techniques are applied on the textual images. 

Another expected output is the respective image(s) based on the keyword given as input to the 
system. 

The user may click the text file which is retrieved along with the image to view the textual 



contents of that particular image file. 



vii. Conclusion 



Therefore, we have presented a framework to perform Automatic Classification of Document 
Images on Android Mobile Devices. We have also formulated and described the mathematical 
model for the same. 
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